Trust Region Policy Optimization
نویسندگان
چکیده
We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.
منابع مشابه
Trust-PCL: An Off-Policy Trust Region Method for Continuous Control
Trust region methods, such as TRPO, are often used to stabilize policy optimization algorithms in reinforcement learning (RL). While current trust region strategies are effective for continuous control, they typically require a prohibitively large amount of on-policy interaction with the environment. To address this problem, we propose an offpolicy trust region method, Trust-PCL. The algorithm ...
متن کاملTrust-pcl: an Off-policy Trust Region Method for Continuous Control
Trust region methods, such as TRPO, are often used to stabilize policy optimization algorithms in reinforcement learning (RL). While current trust region strategies are effective for continuous control, they typically require a large amount of on-policy interaction with the environment. To address this problem, we propose an off-policy trust region method, Trust-PCL, which exploits an observati...
متن کاملTrust-pcl: an Off-policy Trust Region Method for Continuous Control
Trust region methods, such as TRPO, are often used to stabilize policy optimization algorithms in reinforcement learning (RL). While current trust region strategies are effective for continuous control, they typically require a large amount of on-policy interaction with the environment. To address this problem, we propose an off-policy trust region method, Trust-PCL, which exploits an observati...
متن کاملStochastic Variance Reduction for Policy Gradient Estimation
Recent advances in policy gradient methods and deep learning have demonstrated their applicability for complex reinforcement learning problems. However, the variance of the performance gradient estimates obtained from the simulation is often excessive, leading to poor sample efficiency. In this paper, we apply the stochastic variance reduced gradient descent (SVRG) technique [1] to model-free p...
متن کاملOn- and Off-Policy Monotonic Policy Improvement
Monotonic policy improvement and off-policy learning are two main desirable properties for reinforcement learning algorithms. In this study, we show that the monotonic policy improvement is guaranteed from onand off-policy mixture data. Based on the theoretical result, we provide an algorithm which uses the experience replay technique for trust region policy optimization. The proposed method ca...
متن کاملA Trust-region Method using Extended Nonmonotone Technique for Unconstrained Optimization
In this paper, we present a nonmonotone trust-region algorithm for unconstrained optimization. We first introduce a variant of the nonmonotone strategy proposed by Ahookhosh and Amini cite{AhA 01} and incorporate it into the trust-region framework to construct a more efficient approach. Our new nonmonotone strategy combines the current function value with the maximum function values in some pri...
متن کامل